强化学习中的策略优化方法3 - DPG&DDPG

强化学习中的策略优化方法3 - DPG&DDPG

Deterministic Policy Gradient Algorithms(DPG)

首先回顾一下随机策略梯度。

在这里我们把策略\(\pi\)下的累积奖赏记为\(J(\pi)=\mathbb{E}[r^\gamma_1|\pi]\),从状态s经过t步转移到状态s’的概率为\(p(s\rightarrow s',t,\pi)\),折扣态密度为\(\rho^\pi(s')=\int_{\mathcal{S}}\sum^\infty_{t=1}\gamma^{t-1}p_1(s)p(s\rightarrow s',t,\pi)\mathrm{d}s\)\[ \begin{align} J(\pi_\theta)&=\int_\mathcal{S}\rho^\pi(s)\int_\mathcal{A}\pi_{\theta}(s,a)r(s,a)\mathrm{d}a\mathrm{d}s\\ &=\mathbb{E}_{s\sim\rho^\pi,a\sim\pi_\theta}[r(s,a)] \end{align} \] 策略梯度定理为 \[ \begin{align} \nabla_\theta J(\pi_\theta) &= \int_\mathcal{S}\rho^\pi(s)\int_\mathcal{A}\nabla_\theta\pi_\theta(a|s)Q^\pi(s,a)\mathrm{d}a\mathrm{d}s \\ &=\mathbb{E}_{s\sim\rho^\pi,a\sim\pi_\theta}[\nabla_\theta \log \pi_\theta(a|s)Q^\pi(s,a)] \end{align} \]

对于确定性策略\(a=\mu_\theta(s)\)\(J(\mu_\theta)=\int_\mathcal{S}\rho^\mu(s)r(s,\mu_\theta(s))\mathrm{d}s=\mathbb{E}_{s\sim\rho^\mu}[r(s,\mu_\theta(s))]\),对应的策略梯度定理为 \[ \begin{align} \nabla_\theta J(\mu_\theta)&=\int_\mathcal{S}\rho^\mu(s)\nabla_\theta\mu_\theta(s)\nabla_a Q^\mu(s,a)|_{a=\mu_\theta(s)}\mathrm{d}s \\ &=\mathbb{E}_{s\sim\rho^\mu}[\nabla_\theta\mu_\theta(s)\nabla_aQ^\mu(s,a)|_{a=\mu_\theta(s)}] \end{align} \] 还有对应的off-policy版本的策略梯度定理,这里就不写了,可以参阅DPG原文。

Deep Deterministic Policy Gradient (DDPG)

DDPG结合了DPG和DQN。也就是说,在DPG的基础上,引入了DQN中的关键技术:加入replay buffer 减少样本直接的关联;使用target actor network和target critic network计算target,并缓慢更新他们的参数: \(\theta'\leftarrow \tau\theta +(1-\tau)\theta, \tau\ll1\);此外在input层还使用了batch normalization。因为DDPG是off-policy的,为了充分探索环境 ,可以设计一个探索策略 \[ \mu'(s_t)=\mu(s_t|\theta^\mu_t)+\mathcal{N} \] 其中\(\mathcal{N}\)是噪声,可以根据实际问题调整。在DDPG原文中作者使用的是Ornstein-Uhlenbeck过程:

For the exploration noise process we used temporally correlated noise in order to explore well in physical environments that have momentum. We used an Ornstein-Uhlenbeck process (Uhlenbeck & Ornstein, 1930) with θ = 0.15 and σ = 0.3. The Ornstein-Uhlenbeck process models the velocity of a Brownian particle with friction, which results in temporally correlated values centered around 0.

整个算法如下


Randomly initialize critic network \(Q(s,a|\theta^Q)\) and actor \(\mu(s|\theta^\mu)\) with weights \(\theta^Q\) and \(\theta^\mu\).

Initialize target network \(Q'\) and \(\mu'\) with weights \(\theta^{Q'} \leftarrow \theta^Q\) , \(\theta^{\mu'}\leftarrow\theta^{\mu}\).

Initialize replay buffer R.

for episode =1, M do

​ Initialize a random process \(\mathcal{N}\) for action exploration

​ Receive initial observation state \(s_1\)

for \(t=1,M\) do

​ Select action \(a_t=\mu(s_t|\theta^\mu)+\mathcal{N}_t\) according to the current policy and exploration noise

​ Execute action \(a_t\) and observe reward \(r_t\) and new state \(s_{t+1}\)

​ Store transition \((s_t,a_t,r_t,s_{t+1})\) in \(R\)

​ Sample a random minibatch of \(N\) transitions from \(R\)

​ Set \(y_i=r_i+\gamma Q'(s_{i+1},\mu'(s_{i+1}|\theta^{\mu'})|\theta^{Q'})\)

​ Update critic by minimizing the loss: \(L=\frac{1}{N}\sum_i(y_i-Q(s_i,a_i|\theta^Q)^2)\)

​ Update the actor policy using the sampled gradient: \[ \begin{align*} \nabla_{\theta^\mu}\mu|_{s_i}\approx\frac{1}{N}\sum_i\nabla_aQ(s,a|\theta^Q)|_{s=s_i,a=\mu(s_i)}\nabla_{\theta^\mu}\mu(s|\theta^\mu)|_{s_i} \end{align*} \] ​ Update the target networks: \[ \begin{align*} \theta^{Q'}&\leftarrow\tau\theta^Q+(1-\tau)\theta^{Q'}\\ \theta^{\mu'}&\leftarrow\tau\theta^\mu+(1-\tau)\theta^{\mu'} \end{align*} \]end for

end for